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Abstract: The Expectation-Maximization (EM) algorithm (Dempster, Laird 
and Rubin, 1977) is a popular method for computing maximum likelihood 
estimates (MLEs) in problems with missing data. Each iteration of the al- 
gorithm formally consists of an E-step: evaluate the expected complete-data 
log-likelihood given the observed data, with expectation taken at current pa- 
rameter estimate; and an M-step: maximize the resulting expression to find 
the updated estimate. Conditions that guarantee convergence of the EM se- 
quence to a unique MLE were found by Boyles (1983) and Wu (1983). In 
complicated models for high-dimensional data, it is common to encounter an 
intractable integral in the E-step. The Monte Carlo EM algorithm of Wei and 
Tanner (1990) works around this difficulty by maximizing instead a Monte 
Carlo approximation to the appropriate conditional expectation. Convergence 
properties of Monte Carlo EM have been studied, most notably, by Chan and 
Ledolter (1995) and Fort and Moulines (2003). 

The goal of this review paper is to provide an accessible but rigorous in- 
troduction to the convergence properties of EM and Monte Carlo EM. No 
previous knowledge of the EM algorithm is assumed. We demonstrate the im- 
plementation of EM and Monte Carlo EM in two simple but realistic examples. 
We show that if the EM algorithm converges it converges to a stationary point 
of the likelihood, and that the rate of convergence is linear at best. For Monte 
Carlo EM we present a readable proof of the main result of Chan and Ledolter 
(1995), and state without proof the conclusions of Fort and Moulines (2003). 
An important practical implication of Fort and Moulincs's (2003) result relates 
to the determination of Monte Carlo sample sizes in MCEM; we provide a brief 
review of the literature (Booth and Hobcrt, 1999; Caffo, Jank and Jones, 2005) 
on that problem. 



1. Introduction: The Monte Carlo EM algorithm 

The expectation-maximization, or EM algorithm, is an algorithm for maximizing 
likelihood functions, especially in the presence of missing data. When EM works, 
the algorithm's output is a sequence of parameter values that converges to the 
maximum likelihood estimate (MLE). The seminal paper on EM, and that which 
gave the algorithm its name, is the article by Dempster, Laird and Rubin (1977). 
A book length treatment is given by McLachlan and Krishnan (1997). 

Consider a statistical model in which the random vector (Y,U), Y £ M. N and 
U £ IR 9 , has distribution given by /(y, u; 9), a density with respect to the measure 
A x /i, where A and fi are measures on and M. q respectively, and indexed by 
the unknown parameter 9 £ O. We refer to (Y, U) as the "complete data" but only 
Y = y is observed; U represents the unobserved or "missing" data. The MLE of 9 
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is the value 9 which maximizes the likelihood function 

(1) L{6;y) = J f(y,u;6)fx(du) 

or, equivalently, the log likelihood l(0;y) — logL(9;y). The EM algorithm can be 
used to find 9 even if the integral in (1) is intractable. Define the Q-function, a 
mapping on 6 x 6, by 

(2) Q(0\6;y) =E{log f(y,U;9) \ y,6~} , 

that is, the expected value of the "complete data" log-likelihood at 9, given the 
observed data, this conditional expectation evaluated under 9. Each EM iteration 
formally consists of an E-step, to evaluate the conditional expectation in (2), and 
an M-step, to maximize it. More precisely, if 9^ is the parameter value as of the 
tth iteration, the update 6K t+1 ) is chosen such that Q(0( t+1 >|0W; y) > Q{9\9^;y) 
for all 9 e 9. Under regularity conditions (Boyles, 1983; Wu, 1983, and see Sec- 
tion 3 below), and given a suitable starting value 0(°\ the resulting sequence 
: t = 0, 1, . . .} will converge to a local maximizer of L. 
If the integral in (2) admits a closed form solution, the implementation of EM 
is straightforward (though the M-step may still require a numerical optimization 
scheme such as Newton- Raphson) . Suppose it does not. As noted, the evaluation 
of (2) requires taking an expectation with respect to the conditional distribution 
of the missing data U, given observed data Y = y. If one has the means to sim- 
ulate random draws from this target distribution, the Q-function can be approxi- 
mated by Monte Carlo integration. Let . . . , denote a random sample from 
h(u\y; 9) = f(y, u; 9)/L(9; y). Then a Monte Carlo approximation to (2) is given by 

m 

Q m (9\9;y) = - V log /(y, M « ; 0) . 
m z — ' 

fc=i 

In the Monte Carlo EM algorithm (MCEM), first introduced by Wei and Tanner 
(1990), the update 9 (t+1) is the value of 9 that maximizes Q m (0|0 (t) ; y). 

Applications of EM and MCEM have been numerous; in this work we focus on 
one in particular, the two-stage hierarchical model, introduced in Section 2. We give 
two simple but realistic examples from this class of models, and demonstrate the 
implementation of EM and MCEM in those two problems. In Section 3 we discuss 
convergence properties of the EM algorithm. Of course, the question of convergence 
for MCEM is far more complicated, and an accessible discussion of the major results 
in this area is the main objective of this review paper. In Section 4 we provide a 
rigorous but accessible review of the two seminal papers on MCEM convergence, 
those of Chan and Ledolter (1995) and Fort and Moulines (2003). We make some 
concluding remarks in Section 5. 

2. Application: The two-stage hierarchical model 

Let Y = (Yi, . . . , Y/v) T , where each Yi is a random variable in R 1 , denote the ob- 
servable data. In a two-stage hierarchical model, the distribution of Y is specified 
conditionally on some unobservable random quantity U — (Ui, ■ ■ ■ ,U q ) T . Specifi- 
cally, we assume that conditional on U = u, the Yi are independent with conditional 
densities denoted by fi{yi\Ui] 9\), where 9\ € 0i is an unknown parameter and each 
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fi is a density with respect to Lebesgue or counting measure. The /, may also de- 
pend on an observable covariate Xi though this dependence is suppressed in our 
notation. Define f{y\u; 6\) — Y\.i=x fi(Ui\ u i ^1); a density on R*, and this completes 
specification of the first level, or stage, of the hierarchy. At the second stage we 
specify a marginal distribution for U, defined by h(u;02), a density on W 1 that 
depends on the unknown parameter 62 € 02- Assume the parameter spaces Oi and 
02 are open subsets of M. dl and R d2 , respectively. Let d — di + g?2- The unknown 
parameter 9 = (0\, 9 2 ) lies in the parameter space 9 = 61 x 62, an open subset of 



Suppose we wish to compute maximum likelihood estimates (MLEs) of 9\ and 
92 ■ Were the random effects U observable the likelihood function would be given by 
what we will call the complete data likelihood L c (9;y,u) = f(y\u; 9\)h{u] ^2)- But 
since only the data Y are observed, the random effects must be integrated out of 
Lc yielding the likelihood function 



We wish to find the value of 9 that maximizes L, that is, the MLE 9. 

It will most often be the case that the integral in (3) is intractable. Booth, 
Hobert and Jank (2001) provide a very nice summary of numerical and Monte 
Carlo methods available for maximum likelihood in this problem, arriving at the 
conclusion that "Monte Carlo EM is generally the simplest and most efficient Monte 
Carlo fitting algorithm for two-stage hierarchical models." As noted above, the EM 
algorithm is a general method for maximum likelihood in the presence of missing 
data; hierarchical models are cast in this light by viewing the unobserved random 
effects as "missing". 

Let l c = log L c denote the complete data log likelihood, so 



Thus in the setting of hierarchical models, the EM update rule introduced in Section 
1 can be written 



that is, the update of 9\ and that of 62 can be considered separately. 

If one or both of the expectations in (4) is intractable, one might employ the 
Monte Carlo EM algorithm. The MCEM update rule for the two-stage hierarchical 
model is given here. Let 9^ = (9f \ 9^p) denote the current parameter value; then 
6»(* +1 ) is found by 

1. Simulate random sample from the conditional density 



(3) 




l c (0\ y, u) = log f(y\u; 9 X ) + log h{u; 9 2 ) . 



(4) 




2. Compute updates 



ft 




2 



= argmax 



= arg max 
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The "target density" for the Monte Carlo E-step (step 1) is the conditional density 
of the random effects given the data, 

(5) h(u\y;d) cx f{y\u;6 l )h{u;e 2 ) . 

If direct simulation from (5) is impossible, one might resort to a Markov chain Monte 
Carlo (MCMC) method such as the Metropolis-Hastings algorithm. In this case 
the sample \yS t,k ^ : k = 1, . . . , m} is an ergodic Markov chain having h(u\y; 9^) 
as its unique stationary density (see, for example, Robert and Casella, 2004). An 
alternative approach is to compute a quasi-Monte Carlo or randomized quasi-Monte 
Carlo (L'Ecuyer and Lemieux, 2002) approximation to the Q-function with the goal 
of reducing Monte Carlo error and hence increasing the efficiency of the algorithm. 
We will not consider quasi-Monte Carlo methods any further in this report; the 
interested reader is referred to Jank (2004). 

2.1. Example 1: A linear mixed model 

Table 1 contains a data set for an experiment described by Snedccor and Cochran 
(1989). The experiment involved six bulls and very many cows. From each bull, 
some number of semen samples was taken, and each of these samples was used 
in an attempt to artificially inseminate a large number of cows. Some attempts 
were successful and some were not; let Yj,j denote the success rate (percentage of 
conceptions) for sample j from bull i, for j = 1, . . . , n.j and i = 1, . . . , q = 6; here 
N = Y2i=i n i- Consider the one-way random effects model 

Vij = fi + Ui + &ij 

where fi is the overall mean, Ui is the ith bull effect, and e,-j is a residual error term. 
As the six bulls were a random sample from a larger population of bulls, the Ui are 
modeled as independent and identically distributed (i.i.d.) random effects. Model 
specification is completed by a distribution assumption on the bull effect and error 
term; we take 

Ui ~ iid Normal (0, cfj) ; independent of ~ iid Normal (0, a 2 ) . 

When there exists a conjugate relationship between / and h, as in the normal linear 
mixed model, the integral in (3) can be solved explicitly. The resulting log-likelihood 
can be maximized numerically (or analytically in the case of balanced data rii = n) ; 
for the bulls data we obtain fi = 53.318, a 2 u = 54.821, and a 2 = 249.23. 



Consider the EM algorithm. We find it more convenient to work with an equiva- 
lent version of the model in which = m + eij and the Ui are i.i.d. Normal(^t, a 2 ). 
Under this reparameterization the complete data log- likelihood of 9 = (/x, a 2 , a 2 ) is 

N 1 q n< 1 q 

l c (9- y, u) = --\og{al) - — 2 £ - u t f - \ log(^) _ . 

°e i=l j=l °" i=l 



Owing to the conjugacy it is straightforward to show that 
(6) Ui\(Y = y;9) i = 1, . . . ,q are indep Normal 



a 2 n + riialyi 
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Bull (i) 



1 
2 
3 
4 
5 
6 

Total 



5 
2 
7 
5 
7 
9 
35 



Percentage of conception 



46, 31, 37, 62, 30 
70, 59 

52, 44, 57, 40, 67, 64, 70 

47, 21, 70, 46, 14 

42, 64, 50, 69, 77, 81, 87 

35, 68, 59, 38, 57, 76, 57, 29, 60 

Table 1 



Bovine artificial insemination data of Example 1 (Snedecor and Cochran, 1989). 



Denote the conditional mean and variance of Ui given Y = y by iii and Vi, respec- 
tively. Then the EM update rule is given by 



(t+i) - 



-.(*) 



It' 

i=l 



i=l 
1 



-,(t) 



,(*+!) 



N 



E 



E 

i=i 



Vii 



2n l y l u ( p + n t ( vf } + 



-At) 



Given the existence of a closed form EM update, there is no practical reason to 
resort to Monte Carlo EM (indeed there was no practical need for EM, as we found 
a closed form expression for the likelihood as well), but we will consider MCEM 
for illustration. Let u^' 1 ', . . . , denote a sequence of simulated draws from 

h(u\y;6^), given at (6). The MCEM update rule for 6»(* +1 ) is 



,(*+!) 



-EE^' fc) 

mq ^— ' ^ 1 

H fe=l i=i 

y i=l 

jvEEE(y«-^ 



mq 



mN 

k=l i=l j = l 

We ran three independent MCEM runs of 20 iterations each, starting at the point 
(//°) , cr^ <0) , crf 0> ) = (55, 45, 260). For each update we used Monte Carlo sample size 
to = 10 4 ; results are shown in Figure 1. The three dashed lines indicate the paths 
of the three MCEM runs, and the solid line shows that of ordinary (deterministic) 
EM. We did three more runs with starting values closer to the MLE and using 
to = 10 5 ; those results are summarized in Figure 2. 
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Overall mean 




10 

Iteration t 



Between-bulls variance component 




10 

Iteration t 




Fig 1. Trace plots for Monte Carlo EM in Example 1, based on Monte Carlo sample size m = 
10 4 at each iteration. Top left plot is overall mean {i, top right and bottom left are variance 
components o^ and a\, respectively. Bottom right plot shows log-likelihood evaluated at current 
parameter value. The solid line is deterministic EM and the three dashed lines correspond to three 
independent runs of Monte Carlo EM. 
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Overall mean 



Between-bulls variance component 





Within-bull variance component 



Log-likelihood 





Fig 2. Analogous to Figure 1, but with m = 10 5 and starting values chosen closer to the true 
MLE. 



imsart-coll ver. 2011/11/15 file: Neath.tex date: March 8, 2013 



8 



2.2. Example 2: A logit-normal generalized linear mixed model 

Let Y = {Yij : j = 1, . . . , nf, i = 1, . . . , q} denote a set of binary response variables; 
here again one can think of YL- as the jth response for the ith subject. Let Xij be a 
covariate (or vector of covariates) associated with the i, j observation. Conditional 
on the random effects [/ = «£!', the responses are independent Bernoulli^,, ■) 
where 



% 3 ' 1 ' 

2 \ 



Let Ui,...,U q be independent and identically distributed as Normal (0, <r ). The 
likelihood is given by 

L(/3,a 2 ; 2/ ) = ( < 7 2 )- 9/2 x 



/ ex P | E E [f« + ^ lo § C 1 + - i E u ? [ 



du . 



The above model has been used by several authors (Booth and Hobert, 1999; Caffo, 
Jank and Jones, 2005; McCulloch, 1997) as a benchmark for comparing Monte Carlo 
methods of maximum likelihood. We consider here a data set generated by Booth 
and Hobert (1999, Table 2) with m = 15, q = 10, and Xij = j/15 for each For 
these data the MLEs are known to be 0, a 2 ) = (6.132, 1.766). 
A version of the complete data log-likelihood is given by 

i q q m 

IM a 2 - Vl u) = - q 2 log (a 2 ) - — u* + E [fciVU ~ lo S i 1 + e^ +Ui )} . 

i=l i=i j=i 

To apply the EM algorithm in this problem we would need to compute the (condi- 
tional) expectation of l c with respect to the density 

!<? n« 1 9 1 

J2 E [y^ - lo § i 1 + e pxii+Ui )] - E u "\ ■ 
i=i j=i i=i j 

Clearly this integral will be intractable. Thus we consider a Monte Carlo EM algo- 
rithm, which requires the means to simulate random draws from the distribution 
given by (7). McCulloch (1997) employed a variable-at-a-time Metropolis-Hastings 
independence sampler with Normal(0, a 2 ) proposals, which Johnson, Jones and 
Neath (2011) have shown is uniformly ergodic. 

Trace plots for three independent runs of MCEM are shown in the left hand 
panels of Figure 3. The starting values for these runs were (/3(°\ a 2 ^) = (2, 1), and 
we ran 35 updates with Monte Carlo sample size m — 10 4 at each iteration. We 
conducted three more runs of 25 iterations, starting at (/3^°\ er 2 ^ ') = (6,2), with 
m = 10 5 ; results are shown in the right hand panels of Figure 3. 
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Monte Carlo EM trace plots for beta 



Monte Carlo EM trace plots for beta 




Monte Carlo EM trace plots for variance 



Monte Carlo EM trace plots for variance 



CM ^ 
I 



10 15 20 25 30 35 

Iteration t 



Fig 3. Monte Carlo EM trace plots for logit-normal model of Example 2. Top panels show /3, 
bottom panels show a 2 . Three dashed lines correspond to three independent runs of MCEM, with 
solid horizontal line drawn at true MLE. Runs in left hand panels used Monte Carlo sample size 
m = 10 4 at each iteration; in right hand panels we used m = 10 5 with starting values closer to 
the true MLE. 



3. Convergence properties of ordinary EM 

The basic convergence properties of the EM algorithm were established by Boylcs 
(1983) and Wu (1983). The presentation given here draws heavily from Geyer 
(1998). We will show that if an EM sequence converges, its limit must be a sta- 
tionary point of the log-likelihood. We then present conditions that guarantee the 
convergence of EM, with additional conditions that guarantee convergence to the 
MLE. We conclude this section with a proof that the EM algorithm cannot produce 
a superlinearly convergent sequence. 

We begin by proving the ascent property of the EM algorithm, which guarantees 
that an EM update will never decrease the value of the likelihood function, that is, 
if {flW} is an EM sequence, then l{9 ( - t+1 '>;y) > l(9^;y) for each t. 

Define 

R{9\9;y) = v{\ogh{U\y;9)\ y;§} 
(8) = E [log f(y,U;9)\ y; 9~} - E {log f(y; 9)\ y;0} 

= Q(6\9-y)-l(9;y) . 
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We now show that, for fixed 9, R(9\9; y) attains its maximum at 9 = 9. 

Lemma 1. For any 9 E 9, R(9\9; y) > R(9\9; y) for all 9. 

Proof. 



R(9\9;y)~R(9\9;y)=E log 



(h(U\y;9) 



< log E 



\h(U\y;9) 



\h(U\y;9) 

by the conditional Jensen inequality (see Billingslcy, 1995, page 449); now 



E 



f h(U\y;9) 
\h(U\y;9) 



h(u\y;9) 

and thus R(6\6; y) - R{6\6; y) < log(l) = 



^ J-h(u\y;9)du = I h{u\y\9)du 



□ 



Theorem 1. IfQ(9\9;y) > Q(0\0;y), thenl(0;y) > l(9;y). IfQ(9\9;y) > Q(9\9;y), 
then 1(9; y) > 1(9; y). 



Proof. By (8) and Lemma 1, 

l(9;y)-l(9;y) = Q(9\9;y)-Q(9\9;y) 
>Q(9\9;y)-Q(9\9;y) 



R(9\9;y) - R(9\9;y) 



□ 



The ascent property of EM follows immediately from Theorem 1: since 
chosen to maximize Q(9\9^;y), it must be that Q(9 { - t+1 ^\9^; y) > Q(9^\9^;y) 
and thus l(9^ t+1 ^; y) > 1(9^; y). This is an appealing property, as it guarantees that 
an EM update will never take a step in the wrong direction. Of course, this result 
tells us absolutely nothing about the convergence of an EM sequence. 

We now show that if an EM sequence converges, it converges to a stationary 
point of the log-likelihood. Unless otherwise noted, V will denote differentiation 
with respect to the first argument. 

Theorem 2. Suppose the mapping (9,9) t-¥ \7Q(9\9;y) is jointly continuous. If 9* 
is the limit of an EM sequence {#^}, then X7l(9*;y) = 0. 

Proof. Since 6^ +1 ) maximizes Q(6\6 w ;y) at each t we have VQ(9 ( ~ t+1 ^\9^;y) = 
at each t. By the continuity assumption VQ(9 ( - t+1 '>\9^ ;y) VQ(9*\9*;y) as* oo 
and thus WQ(9*\9*;y) = 0. Let R be as defined at (8), and note 



VR(0\0;y) = 



0_ 
00 



\ogh(u\y;0) 



h(u\y; 0)du 



m h (u\y,9) 

h(u\y;0) 



0_ 
00 



h(u\y; 0)du 




h(u\y; 9)du = ^(1)=0 . 



It then follows from (8) that 

Wl(9*;y) = WQ(9*\9*;y) = 



□ 
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From Theorem 2 we have that if the EM algorithm converges, it converges to a 
stationary point of I; we as yet have no guarantee that EM converges. By the ascent 
property, the limit of an EM sequence (if it exists) cannot be a local minimum. It 
can, however, be a local but not global maximum (Wu, 1983, cites several examples) 
or a saddlepoint (Murray, 1977, gives an example). 

We will now specify conditions that do guarantee the convergence of the EM 
algorithm. We define a generalized EM sequence as one in which each update in- 
creases the Q"fu nc t,ion, font does not necessarily maximize it. 

Definition 1 . A generalized EM ( GEM) sequence is a sequence of parameter values 
{6>W} satisfying Q(0(* +1 )|0W; y) > Q(9^\9^;y) for each t. 

It is immediately clear from Theorem 1 that a GEM sequence enjoys the ascent 
property l(9^ t+1 ^; y) > Z(0W; y). The conclusion of Theorem 2, that the limit of an 
EM sequence (if it exists) must be a stationary point of /, does not hold for GEM 
without additional assumptions. 

Consider a sequence of parameter values {0^} satisfying #( t+1 ) g M(0W) for 
some point-to-set mapping M. For example, a GEM sequence can be formulated 
in this manner by taking M(6) = {fl : Q(6\6;y) > Q{9\0; We will indicate a 
point-to-set mapping M in 9 by the notation M : 9 =t 9. 

Definition 2. The point-to-set mapping M : 9 =4 9 is outer semicontinuous if the 
graph of M, 

{(0,0) e 9 x 9 : e M(0)} 

is a closed set; that is, if for any convergent sequence {(0C 4 ),0«)} satisfying 6^ > e 

M(§®) for each t, the limit (0*,0*) satisfies 9* G M(0*). 

The following theorem gives a set of conditions under which every cluster point 
of a GEM sequence lies in a particular set T C 9. 

Theorem 3. Let T C 9 and M : 9 =4 9 6e smc/i i/iai ifte following conditions 
hold. 

1. M{6) c {0 : Q(0\6;y) > Q(0|0;y)} w/ien 8 eT. 

2. M(0) c {0 : Q(9\9;y) > Q(0|0;y)} io/ien G 9 \ T. 

3. The restriction of M to 9 \ T is outer semicontinuous. 

Further suppose that the log-likelihood I is continuous, that the level set 
{0 : l(6;y) > 1(0^°'; y)} is compact, and let the sequence {0« :i = 0,l,2,...} be 
such that 0( t+1 ) G M(9^) for each t. Then l(9^;y) converges to a limit, and every 
cluster point of |0(*H is contained in T. 

Proof. By assumption the log-likelihood is bounded above. Also, {0^ t - 1 } is a GEM 
sequence, hence 1(9^ ;y) is nondecreasing, so it converges to a limit A. 

Suppose to get a contradiction there exists a subsequence 9^^ — > 0* ^ T. Con- 
sider the subsequence {0'* fc+1 )}. By the ascent property Z(0' tfc+1 '; y) > 1(9^; y) for 
each k, so the compactness assumption guarantees that {0( tfc+1 )} has a convergent 
subsequence with limit 9**. Further, 0** G M(9*) by the outer semicontinuity of 
M, and thus Q(9**\9*;y) > Q(9*\9*;y) and thus l(6**;y) > 1(9* ;y) by assumption 
2 and Theorem 1, respectively. But 1(9** ;y) = A = 1(9* ;y) by continuity of I, a 
contradiction. 

Thus all cluster points of {0^*H are in T. □ 
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In the obvious application of Theorem 3 the solution set T is taken to be the set 
of stationary points of the log-likelihood. We now have a set of conditions under 
which the EM algorithm is guaranteed to converge to the unique MLE 9. 

Corollary 1. If the conditions of Theorem 3 hold and the set V consists of a single 
point 9, then the sequence converges to 9. 

Unfortunately, these conditions can be difficult or impossible to verify in many 
practical applications. Further, the rate of convergence of the EM algorithm cannot 
be supcrlincar, as we show here. 

Definition 3. The sequence {0'*'} converging to 9 is said to converge superlinearly 
if 



t+l 



as t — fr oo, where 



- y = o i 

denotes the standard Euclidean norm. 



Lemma 2. Suppose the log-likelihood is twice continuously differ entiable with a 
local maximum at 9 and suppose that ~S7 2 l{9;y) is nonsingular and negative def- 
inite. Further supose that \ 7 Q(8\8;y) is nonsingular and negative definite and 
\7 2 Q(9\9;y) — \7 2 l(9;y) is nonsingular. Define the sequence by 



(9) 



0(t+l) = 0« _ [v 2 Q(0«|0W;y)l VQ(9^\9^-y) 



and suppose that 8^' 



Then the convergence is not superlinear. 



Proof. Let Snr denote the Newton-Raphson update increment for the optimization 
of I, that is, if } is a Newton-Raphson sequence then 9 /( - t+1 ~> = 8'^+5 NR (9'^) 
for each t, or 



(9) = -[\7 2 l(9;y)} 'vi^y) 



Since W 2 l(8; y) is continuous and W 2 l(9; y) is nonsingular, it must be that V 2 l(9; y) 
is invertible in a neighborhood of 9, and thus 5nr(9^) is well-defined for sufficiently 
large t. 

By convergence of {#(*H and the continuity of VI, Vl(9^;y) — fr Vl(8;y) = 0. 
Together with the continuity of W 2 l(9; y), this guarantees that 



5 N R(0 {t 



V 2 l{9^-y)} Vl(9^-y)^\v 2 l{9-y) 



= 



as t — fr co. Now, consider the sequence {yi{9^;y)/\\Vl{9 t ^ > ]y)\\\. This sequence 
lives on the unit sphere, a compact set, and hence has a convergent subsequence. 
Let {t^} denote the indices of a convergent subsequence and b its limit. Then 
(10) 

0(t*+D_0(t*) _ - [V 2 Q(8^\8^;y)] 1 Vl(0<**>;i/) 



|VI(0(**);i/)|| 



\\Vl(8(^;y)\\ 



-fr- 



aud 



(11) 



W0 (ffc) ) ~[V 2 l(0^;y)] 'W(g (tt) ;y) 

\Vl(9«*);y)\\ ||VZ(0(*fc);y)|| 



V 2 Q{9\9;y) 



V 2 l{9y) 



as k — fr co. The equality in (10) follows from the fact that VQ(6\8;y) — \ 7 l(9;y) 
for any 9. 
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Suppose the sequence does converge superlinearly. Then it is asymptoti- 

cally equivalent to Newton-Raphson by the Dennis-More characterization theorem 
(see, for example, Fletcher, 1987), and thus the (sub)sequences defined in (10) and 



(11) must have the same limit. Then V 2 Q(6>|6>; y) b = V 2 Z(6»;y) 



-l 



-l 



b = c. So 



V 2 Q(e\6-y)~V 2 l{6-y) 



c = 



and thus c — since V 2 Q(9\9; y) — W 2 l(9; y) is full rank. But b must be on the unit 
sphere, a contradiction. 

Thus the convergence of {#(*H to 9 is not superlinear. □ 

The algorithm defined at (9), with update rule given by a single Newton-Raphson 
iteration toward the maximum of the Q-function, was first introduced by Langc 
(1995) and is known as the EM gradient algorithm. Details are beyond the scope of 
this report, but roughly speaking, the convergence properties of the EM algorithm 
are equally enjoyed by Lange's (1995) EM gradient algorithm. Thus while Lemma 
2 takes the convergence of the EM gradient sequence as a given, there is no sacrifice 
in the applicability of the result, as the EM gradient converges to a local maximum 
under essentially the same conditions as does the EM algorithm. 

Theorem 4. Suppose the EM sequence converges to a point 9* £ 0, a 

stationary point of the log-likelihood. Further suppose that l(9;y), Q(9\9;y), and 
R(9\9;y) are twice continuously differ 'entiable in9 and thafV 2 l(9* ; y) , ~S7 2 Q(9*\9* ;y) , 
and V 2 R{6* \9* ; y) have full rank. Then the convergence cannot be superlinear. 

Proof. Let Seg denote the EM gradient update increment, that is, if {fl^*- 1 } is an 
EM gradient sequence then 6»'( t+1 ) = 0'W + S EG (9'^) for each t: 

Seg(0) = - [V 2 Q{9\9-y)Y 1 VQ{9\9-y) . 

Define 8 em analogously, so 9 + o~eg{6) represents the first iteration in a Newton- 
Raphson routine starting at 9 and converging to 9+8em{9). Since Newton-Raphson 
converges superlinearly in this subproblem (see, for example, Fletcher, 1987, The- 
orem 3.1.1), we have 

+ 8 EG (0) -[9 + S EM (0)} = o (||<W(0)ll) 



or 



Seg(0) = 5 EM (9) + o(\\8 E m(9)\\) 



and thus the EM gradient algorithm (9) is asymptotically equivalent to the EM 
algorithm. But EM gradient is not superlinearly convergent by Lemma 2, and thus 
neither is the EM algorithm. □ 



4. Some convergence results for Monte Carlo EM 

It seems a statement of the obvious (and an understatement at that) to point out 
that the study of convergence properties of Monte Carlo EM is more complicated 
than that of ordinary EM. Even before coming to face the complexity of the math- 
ematical arguments, one must determine which notion of "convergence" one wishes 
to consider - what exactly is going to infinity? We mention here three distinct 
approaches to the problem. 

imsart-coll ver. 2011/11/15 file: Neath.tex date: March 8, 2013 



11 



The first serious effort in establishing convergence properties of MCEM is that 
of Chan and Ledolter (1995), who treat the data as fixed, and hold the Monte Carlo 
sample size m constant across MCEM iterations. They then let m go to infinity, 
and study the asymptotic properties of the MCEM sequence as a Monte Carlo 
approximation to the ordinary EM sequence with the same starting value (whose 
convergence properties are well understood). We will discuss Chan and Ledolter's 
(1995) results in considerable detail in subsection 4.1. On the other hand, unless 
the Monte Carlo sample size is allowed to increase with the iteration count, there 
is no chance for convergence in the usual sense (convergence to the MLE) because 
of persistent Monte Carlo error. 

In the version of MCEM considered by Sherman, Ho and Dalai (1997), the Monte 
Carlo E-step is carried out by running multiple (independent) Markov chains gener- 
ated by a Gibbs sampler. Their theoretical results are built on allowing the number 
of chains, the length of each chain, and the number of EM iterations T to all tend 
to infinity, as does the data sample size N. They then prove vTV-consistency and 
asymptotic normality of the estimator 8^ T \ In other words, Sherman, Ho and Dalai 
(1997) found conditions under which the MCEM approximation to the MLE enjoys 
the same asymptotic properties as the MLE itself. This represents yet another pos- 
sible notion of "convergence" of MCEM, though not one that we will pursue any 
further in the present paper. 

Fort and Moulines (2003) treat the data as fixed, the Monte Carlo sample size 
as increasing (deterministically) across MCEM iterations, and establish a.s. conver- 
gence of the sequence as the iteration count goes to infinity. We consider this the 
strongest known result on the asymptotic properties of MCEM, as this notion of 
convergence seems the most consistent with that of ordinary (deterministic) EM. 
We summarize Fort and Moulines (2003) main conclusions in subsection 4.2. 

4-1- A result of Chan and Ledolter (1995) 

Chan and Ledolter (1995) showed that, given a suitable starting value, a sequence 
of parameter values generated by the Monte Carlo EM algorithm will get arbitrarily 
close to a maximizer of the observed likelihood with high probability. Their main 
result is given as Theorem 5 below. We first establish one more convergence property 
of deterministic EM, also attributable to Chan and Ledolter (1995). 

Let Mem ■ © —> © denote the mapping given by the deterministic EM update 
rule, that is, Mem(9) = argmaxQ(0|0; y). 

Lemma 3. (Lemma 1 of Chan and Ledolter, 1995) Suppose 9* is a local maximizer 
of the log-likelihood 1(9; y), a continous function of 9, and that there exists a neigh- 
borhood in which 9* is the only stationary point. Then for any neighborhood Af of 
9* , there exists a neigborhood Af* such that an EM sequence {6^' : t = 0, 1, 2, . . .} 
started at any 8 {0) G Af* , satisfies (i) 6® G TV for all t = l,2,...; and (ii) 9^ -t 9* 
as t — > oo . 

Proof. Let TV be a neighborhood of 9*. There exists a compact, connected sub- 
neighborhood TV* C Af such that (i) 1(8; y) attains its maximum over Af* at 9*, 
(ii) Af* contains no other stationary points of I, and (ii) there exists e > such 
that 1(9; y) > 1(9* ;y) — e for all 8 e Af* . It follows from these conditions that 
M EM (9) € Af* for any 8 e Af*; thus an EM sequence with 0(°) G Af* 

satisfies 9® G TV*, and thus 6® G TV, for all t = l,2,.... 
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Continue to assume that 9^ G Af* and consider the EM sequence {6K*)}. Now 
the sequence |Z(6*'*';y)} is nondecreasing and bounded above by l(9*;y), and thus 
converges to a finite limit, call it A. The sequence j^*-** 1 } lives in Af* , a compact set; 
let {#(* fc )} be a convergent subsequence and denote its limit by 9** e A/"*. 

Suppose 0** ^ 6*. Then Z (6>^ f fc+1 > ; y) -> l(M EM (9**);y) > l(9**;y) = A. That is, 
the subsequence [l(9^ tk+1 ^; y)} converges to a limit greater than A, a contradiction. 

Thus any convergent subsequence of {f^*- 1 } must converge to 6*; thus {f?W} 
converges to 9* . □ 

In the terminology of the stability theory of dynamical systems (see, for example, 
Arrowsmith and Place, 1992, section 3.5), the lemma asserts that an isolated local 
maximizer 9* of 1(9; y) is an asymptotically stable fixed point for the EM algorithm. 
Practically, Lemma 3 tells us that an EM sequence with a sufficiently close starting 
value will remain arbitrarily close to 9* (by stability) as well as converge to 9*. 

Theorem 5. (Theorem 1 of Chan and Ledolter, 1995). Let {6*^} denote a Monte 
Carlo EM sequence based on Monte Carlo sample sizes m t = m, and suppose 
that the MCEM update A4 m (9) := arg max Q m (9\9; y) converges in probability to 
Mem (9) as m — > oo . Further suppose that this convergence is uniform on compact 
subsets of 0. Let 9* be an isolated local maximizer ofl(9;y), a continous function 
of 9. Then there exists a neighborhood of 9* such that for any starting value 9^ in 
that neighborhood and for any e > 0, there exists Tq such that 

(12) Pr{||6> (t) - 6»*|| < e for some t < T | -> 1 
as the Monte Carlo sample size m —¥ oo. 

Proof. Let Af be the set defined as Af* in the proof of Lemma 3, so that Af is 
compact and connected, contains 9*, and Mem (9) G Af for any 9 € Af. For any 
e > 0, we will find T such that (12) holds for any 0<°) G TV. 

Let e > be given. First, there exists a positive number s\ < e such that 
Afi := {6 G Af : \\9 - 0*\\ > £i} is nonempty; if 9 G Af%, then M EM {9) ± 9. By the 
ascent property and by continuity of I in 9 there exist 6, 5i > such that for any 
9 e A/i, if \\0' - Mem(9)\\ < 5, then l(9':y) - l{9;y) > S,. 

By construction of Af, there exists 62 > such that for any 9 £ Af, any 9' with 
1 1 & — Mem (0)| I < <$2 is also in Af . Without loss of generality we can take 82 < 5. 
Thus we have that for any 9 G Ai, any 9' with ||6" - Mbm(0)|| < 82 is also in Af 
(though not necessarily in Afi) and 1(9'; y) — 1(9; y) > Si. Let 

(13) i?= sup {l(9;y)-l(9';y)} < cx) 

8,6'eAf 

and let T = [R/Si\ + 1, where |_-J denotes the greatest integer function. 

Now, suppose an element of the MCEM sequence 9^ — 9 e Af. Then the 
probability that its MCEM update 9 {t+1 ) = M m (9 {t) ) is also in Af is 

(14) Pr{6»( t+1 ) G Af I (t) = <?} > Pr{||6»( t+1 ) - Af B M(6» (i) )|| < $2 | (t) = <?} 

by the definition of 62 ■ Denote a lower bound on the right hand side of (14) by p = 
p($2 , m) > and note that (i) p can be chosen not to depend on the value of 9 G Af 
by the compactness of Af and the uniformity of convergence A4 m (9) —> Mem(9) 
over compact subsets of 0; and (ii) for fixed 62, p(d2,m) —¥ 1 as m — > 00. 
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Consider running a Monte Carlo EM algorithm for To updates. For any starting 
value 6K°) e TV, 



and since each Monte Carlo EM update is calculated independently, the right hand 
side of (15) is bounded below by p($2, m) T ° . 

Now, suppose that 0(°) € TV, and that ||0( t+1 ) - M EM (9 (t) )\\ < 6 2 for each t, 
and thus 9^ € TV for each t. Suppose to get a contradiction that \\9^ — 0*|| > £i, 
that is, that G TVi for each t = 0, 1, . . . ,T . Then Z(6»^ +1 >;y) - l(9 {t) ;y) > Si 
for each £ = 0, 1, . . . , T - 1, and thus Z(0< T °); y) - Z(0 (o) ; y) > SiT > R. But that 
contradicts (13), the definition of R, since 0(°) and 0( T °) are both in TV. 

Thus it must be that if 0<°) e TV and ||6»( t+1 ) - M EM {9 {t) )\\ < S 2 for each t, 
then ||0W - 0*|| < £j < e for some i, which occurs with probability not less than 
p(S2, m) T ° , which converges to 1 as m — > oo. □ 

A couple of remarks are in order. First, we note that the assumptions of Theorem 
5 are slightly different than those made by Chan and Lcdoltcr (1995) in that where 
we assumed uniform convergence of the Monte Carlo EM update, Chan and Ledolter 
(1995) assumed conditions on the form of the log-likelihood sufficient to guarantee 
it. Secondly, the conclusion of Theorem 5, while interesting, is unsatisfying in at 
least one respect: It does not guarantee the convergence of an MCEM sequence 
in any meaningful sense. Practically, what this theorem tells us is that if you run 
the algorithm long enough (at least T iterations), the resulting sequence will, with 
high probability, at some point get arbitrarily close to the MLE. But to an analyst 
examining the output of an MCEM run, even a very long one, there is no way to 
know when that has happened, if at all. A more powerful result would be one that 
specifies conditions under which the algorithm gets close to the MLE and stays 
there. 

4.2. A result of Fort and Moulines (2003) 

Fort and Moulines (2003) used the ergodic theory of Markov chains to prove the 
almost sure (a.s.) convergence of a variation of the Monte Carlo EM algorithm. We 
will state their assumptions and main conclusion; the proof is highly technical and 
beyond the scope of this report. 

We will state Fort and Moulines (2003) convergence result assuming that the 
Monte Carlo E-step is accomplished by i.i.d. sampling. In fact the result holds 
more generally under Markov chain Monte Carlo methods, assuming the underlying 
Markov transition kernel is uniformly ergodic (see, for example, Jones and Hobert, 



Fort and Moulines (2003) consider a variation of Monte Carlo EM they call stable 
MCEM, which we define here. Let {K-t ■ t = 0, 1, 2, . . .} be a sequence of compact 
subsets of satisfying 



(15) 




2001). 



OO 



(16) 



ICt C /Ct+i for each t, and 
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Set po = and choose 6K°) £ /C . Given 6> (t) and p f , the stable MCEM update rule 
for 6^* +1 ) and pt+i is given by 

1. Let 9' be the ordinary MCEM update as defined in Section 1. 

2. If 9' £ lC pt , then 6K t+1 ) = 9' and p t+1 = p t . 

If 9' <£ tC pt , then 6»( t+1 ) = and p t+1 = p t + 1. 

Thus in stable MCEM, any time the ordinary MCEM update falls outside a specific 
set, the algorithm is reinitialized at the point 9^; pt counts the cumulative number 
of reinitializations as of update t. Fort and Moulincs (2003) showed that under 
appropriate assumptions (see Theorem 6 below), {p t } is a.s. finite. 

We will assume that the complete data model f(y, u; 9) is from the class of curved 
exponential families: Let y C 1* denote the range of Y and WcI' the range of U. 
We assume that for some integer k there exist functions (j> : — > Mr, tp : — ► M k , 
and S : y x U -> S C R k such that 

l c (9; y, u) = log f(y, u; 9) = ^(6) T S(y, u) + <p(6) . 

Since l c depends on (y, u) only through s = S(y, u) we can write l c (9; s) = ip(9) T s + 
4>(9). Note that the curved exponential families include the linear mixed model of 
Example 1 in Section 2, but not the logit-normal GLMM of Example 2. 
We will further assume that 

1. 4> and ijj are continuous on 0, S is continuous on ^ x U; 

2. for all 9 6 0, S(9;y) :— E{S(y, U) \ y\ 9} is finite and continuous on 0; 

3. there exists a continuous function : 6> — !• such that for all s E S, 
l c (9(s);s) = sup ee0 l c (9;s); 

4. the observed data log-likelihood 1(9; y) is continuous on 0, and for any A, the 
level set {9 € : 1(9; y) > A} is compact; 

5. the set of fixed points of the EM algorithm is compact. 

Let L denote the set of fixed points of the EM algorithm; in a curved exponential 
family, and using the notation introduced above, F = |(9 e : 9(S(9;y)) = d\. 

As shown by Wu (1983, Theorem 2), under the above assumptions, if is open 
and <j) an d ip are differentiable on 0, then 1(6; y) is differentiable on and F = 
{9 e : Wl(9;y) — 0}. In other words, the set of fixed points of the EM algorithm 
coincides with the set of stationary points of the log-likelihood 1(6; y); see also our 
Theorem 2. 

Finally, note that assumptions 4. and 5. guarantee that the set {l(6;y) : 6 £ T} 
is compact as well. We can now state Fort and Moulines's (2003) main result. We 
will denote the closure of a sequence by Cl(-), so that Cl({6> (t) }) represents the 
union of the sequence itself with its limit points. 

Theorem 6. (Theorem 3 of Fort and Moulines, 2003) Assume the complete data 
model is from the class of curved exponential families, and the model satisfies as- 
sumptions 1. through 6. above. Consider an implementation of the stable MCEM 
algorithm using a sequence of sets {K-t} satisfying (16). Let 6^ £ /Co and suppose 
the Monte Carlo sample sizes {m t } satisfy X^t^o 771 *^ 1 < 00 • Then 

1. (a) lim t _ i . 00 p t < oo with probability 1 (w.p. 1) and limsup^^ < oo 

w.p. 1; 

(b) {l(6^;y)} converges w.p. 1 to a connected component of l(T;y) where 
T denotes the set of stationary points of 1(6; y) (and fixed points of the 
EM algorithm) . 
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2. If, in addition, {1(6; y) : 9 € Tn C1({#W})} has an empty interior, then {l(9^; y)} 
converges w.p. 1 to a point A* and {$(*H converges to the set {9 : 1(9; y) = A*}. 

It is often the case that the set T is made up of isolated points; the above theorem 
then guarantees pointwise convergence of {l(9^';y)} to a stationary point oil(9;y). 
If T consists of a single point 9, the theorem guarantees that 1(9^; y) — > 1(9; y) w.p. 
1 and 9^ — > 9 w.p. 1, analogous to Corollary 1. 

Finally, we note that the assumption that ^ vn~[ 1 < oo can be weakened in many 
instances, but is necessarily of the form ^ m t p < oo for some p > 1. 

5. Remarks: Lessons for the (MC)EM practitioner 

We conclude with a brief discussion of the practical implications of the convergence 
results of Sections 3 and 4. First, as we noted in our discussion following Theorem 
2, even when EM converges, there is no guarantee in general that it has converged 
to a global maximum. In more complex settings such as mixture models, or model- 
based clustering, the likelihood function may have multiple optima, most of which 
will be local optima. While the EM algorithm may converge, its limit point is 
sub-optimal. Solutions to overcome local optima can include merging the ideas of 
the EM algorithm with those of global optimization. One example is described in 
the paper by Heath, Fu and Jank (2009) who combine EM with the cross-entropy 
method and model reference adaptive search, two global optimization heuristics. 
Another example can be found in Tu, Ball and Jank (2008), who combine the EM 
algorithm with the genetic algorithm to model flight delay distributions. 

With respect to Monte Carlo EM, as we have previously noted, the Monte Carlo 
sample size must be increased with the iteration count; otherwise there is no chance 
for convergence in the usual sense, due to the persistence of Monte Carlo error. The 
convergence results of section 4.2 (Fort and Moulines, 2003) require J2 m 7 1 < 00 ■ 
Intuitively it makes sense to start the algorithm with modest simulation sizes: 
when the parameter value is relatively far from the MLE, the (deterministic) EM 
update makes a substantial jump, and less precision is required for the Monte Carlo 
approximation to that jump. When the parameter value is close to the MLE, as 
will be the case after a number of iterations, the EM update is a small step, and 
greater precision is required for the Monte Carlo approximation. 

Thus it is clear that mt must be an increasing function of t, though it is not at all 
clear what might be an appropriate form. In fact there exists a literature, beginning 
with Booth and Hobcrt (1999), on automated Monte Carlo EM algorithms, in which 
the simulation size for each Monte Carlo E-step is determined internally to the 
algorithm, based on some rule for assessing the level of precision required for the 
Monte Carlo approximation at hand. Other authors who have contributed to this 
literature include Levine and Casella (2001) and Caffo, Jank and Jones (2005). 

One can view the Monte Carlo EM update to the parameter value 9^ as an 
estimate of the deterministic EM update Mem(9^)- In Booth and Hobert's (1999) 
algorithm, each MCEM update requires the computation of an asymptotic confi- 
dence region for Mem(9^) in addition to the point estimate M. mt (9^)- If 9^> 
falls within this confidence region, we must accept that the current parameter 
value 9^ is statistically indistinguishable from its EM update Mem(9^)- This 
suggests that the MCEM update was "swamped by Monte Carlo error," and thus 
the simulation size must be increased at the next iteration. The reader is referred 
to Booth and Hobcrt (1999) for details and examples. Levine and Casella (2001) 
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use a regeneration-based approach to Monte Carlo standard errors in computing 
their confidence region. 

The Ascent-based Monte Carlo EM algorithm of Caffo, Jank and Jones (2005) 
seeks to prevent the MCEM update from being swamped by Monte Carlo error by 
successively appending the Monte Carlo sample until one has a pre-specified level 
of confidence that the proposed update increases the log-likelihood over the current 
parameter value, that is, until we are confident that indeed Z(6»( t+1 ); y) > l(6& ;y). 
Recall that this ascent property is guaranteed for ordinary EM (Theorem 1). Since 
the MCEM update maximizes an estimate of the Q-function rather than the Q- 
function itself, there is no ascent property for MCEM in general. But a parameter 
update computed according to the Ascent-based MCEM rule will increase the log- 
likelihood with high probability. Again the reader is referred to the source (Caffo, 
Jank and Jones, 2005) for details. Empirical comparisons between Ascent-based 
MCEM and Booth and Hobert's (1999) algorithm can be found in Caffo, Jank and 
Jones (2005) and Neath (2006). 

A second practical implication of the convergence properties of Monte Carlo 
EM relates to convergence criteria, or stopping rules for the algorithm. At what 
point should the MCEM iterations be terminated and the current parameter value 
accepted as the MLE? The usual stopping rules employed in a deterministic iterative 
algorithm like ordinary EM terminate when it is apparent that further iterations 
(i) will not substantively change the approximation to the MLE, or (ii) will not 
substantively change the value of the objective (likelihood) function. For example, 
one might terminate at the first iteration t to satisfy 



for user-specified 8 and e, where the maximum is taken over components of the 
parameter vector. In Monte Carlo EM, such criteria run the risk of terminating too 
early, as (17) may be attained only because of Monte Carlo error in the update. An 
obvious but inelegant solution is to terminate only after (17) is met for, say, three 
consecutive iterations. This is the stopping rule recommended by Booth and Hobert 
(1999). Other MCEM stopping rules considered in the literature include Chan and 
Ledolter's (1995) suggestion to terminate at the first iteration where 1(6^' \y) — 
£(#(* _1 );y) is stochastically small; in a similar vein Caffo, Jank and Jones (2005) 
terminate when an asymptotic upper bound on Q(9^\9^ 1 '; y) — Q{9^ 1 ^\0^ t ~ is> ; y) 
falls below a pre-specified tolerance. 

Finally, we note that while our focus throughout has been on finding a good 
approximation to the MLE, meaningful statistical inference requires at minimum 
a reliable estimate of the standard error as well. A formula in Louis (1982) ex- 
presses the observed Fisher Information as an expectation taken with respect to 
the conditional distribution of the unobserved data given the observed data. Thus 
a Monte Carlo approximation to the inverse covariance matrix of the MLE is read- 
ily available from the simulation already conducted to compute the final MCEM 
update. 
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